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Abstract. We consider a general nonparametric regression model called the compound model. It includes, 
as special cases, sparse additive regression and nonparametric (or linear) regression with many covariates 
but possibly a small number of relevant covariates. The compound model is characterized by three main 
parameters: the structure parameter describing the macroscopic form of the compound function, the 
microscopic sparsity parameter indicating the maximal number of relevant covariates in each component 
and the usual smoothness parameter corresponding to the complexity of the members of the compound. 
We find non-asymptotic minimax rate of convergence of estimators in such a model as a function of these 
three parameters. We also show that this rate can be attained in an adaptive way. 
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1. Introduction 



High dimensional statistical inference has known a tremendous development over the past ten 
years motivated by applications in various fields such as bioinformatics, computer vision, financial 
engineering. The most intensively investigated models in the context of high-dimensionality are 
the (generalized) linear models, for which emcientprocedures are well known and the theoretical 
^"p | properties are well understood (cf., for instance, 0, 0, 21\)- More recently, increasing interest 



is demonstrated for studying nonlinear models in high-dimensional setting 15,11,3; EEJ under 
various types of sparsity assumption. The present paper introduces a general framework that unifies 
these studies and describes the theoretical limits of statistical procedures in high-dimensional non- 



^vq | linear problems. 

In order to reduce the technicalities and focus on the main ideas, we consider the Gaussian white 
noise model, which is known to be asymptotically equivalent, under some natural conditions, to 
the model of regression 0, Hl|, as well as to other nonparametric models [8lll2j|. Thus, we assume 
that we observe a real-valued Gaussian process Y = {¥((/)) : <fi g L 2 ([0, l] d )} such that 

%TO]=/ f(x)<P(x)dx, Cav f (¥(</>),¥(<!>')) = e 2 f <P(x)<j>'(x) das, 
J[o,i] d J[o,i] d 

for all 4>,<fi' <E L 2 ([0, l] d ), where / is an unknown function in L 2 ([0, l] d ), Ej and Cov/ are the 
expectation and covariance signs, and e is some positive number. It is well known that these 
two properties uniquely characterize the probability distribution of a Gaussian process that we 
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will further denote by P/ (respectively, by Po if / = 0). Alternatively, Y can be considered as a 
trajectory of the process 

dY(x) = f(x)dx + edW(x), x G [0, l] d , 

where W{x) is a <i-parameter Brownian sheet. The parameter e is assumed known; in the model 
of regression it corresponds to the quantity cr 2 n _1 / 2 , where a 2 is the variance of noise. Without 
loss of generality, we assume in what follows that < e < 1. 

1.1. Notation 

First, we introduce some notation. Vectors in finite-dimensional spaces and infinite sequences will 
be denoted by boldface letters, vector norms will be denoted by | • | while function norms will be 
denoted by || • || . Thus, for v = (v\, . . . , va) G R d we set 

|v| = . 1 ^ 0), |v|oo = max \vj\, |v|« = V .\vj\ q , 1 < q < oo, 

*■ — 0=1 ] = l,...,d * ' — 0=1 

whereas for a function / : [0, l] d — > R we set 

11/11^= sup |/(x)|, ||/||2=/ \f(x)\"dx, 1 < q < oo. 

x£[0A] d J[0A] d 

We denote by Lq([0, l] d ) the subspace of L 2 ([0,l] d ) containing all the functions / such that 
J, Q1 , d f(x)dx = 0. The notation (•, •) will be used for the inner product in L 2 ([0, that is 

(h,h) = Jj Q 1 j d h(x)h(x) dx for any h,h e i 2 ([0,l] d ). For two integers a and a', we denote by 
[a, o'J the set of all integers belonging to the interval [a, a']. We denote by [t] the integer part 
of a real number t. For a finite set V, we denote by \V\ its cardinality. For a vector x £ R d 
and a set of indices V C {1, . . . , d}, the vector xy e is defined as the restriction of x to 
the coordinates with indices belonging to V. For every s e {l,...,d} and m e N, we define 
V d = {V C {1, . . .,d} : \V\ < s} and the set of binary vectors B d s m = {r] e {0, 1} V = : \r]\ = m). 
We also use the notation Md, s — \Vf\. We extend these definitions to s = by setting Vq — {0}, 
Mdfi = 1, |^o,i| = 1, and \B d m \ = for m > 1. For a vector a, we denote by supp(a) the 
set of indices of its non-zero coordinates. In particular, the support supp(r7) of a binary vector 
V = {Vv}vevf e Bs.m is the set of V's such that r\y = 1. 

L2. Compound functional model 

In this paper we impose the following assumption on the unknown function /. 

Compound functional model: There exists an integer s G {l,...,d}, a binary sequence r] G 
B d m , a set of functions {fy G -ko([0, l]' V ')}y e v^ an< ^ a constant f such that 

f(x) = f+J2fv(xv)w = f+ E M x v)> V ^l". (1) 
vev d yesu PP (77) 

The functions fy are called the atoms of the compound model. 
Note that, under the compound model, / = J, Q ^ f(x) dx. 
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The atoms /y are assumed to be sufficiently regular, namely, each /y is an element of a suitable 
functional class £y. In particular, one can consider a smoothness class Sy and more specifically 
the Sobolev ball of functions of s variables^. In what follows, we will mainly deal with this example. 

Given a collection £ = {£v}vev d °f subsets of £q([0, 1] s ) and a subset B of m , we define the 
classes 

where 

Fr,(E) = {/ : R d -> R : 3f £ K, {/y}y esupp(r?) , fv £ E v , such that / = / + E /y}. 

yesupp(77) 

The class F s . m {S) is defined for any s € {0, . . . , d} and any m £ {0, . . . , Md )S }- In what follows, 
we assume that B is fixed and for this reason we do not include it in the notation. Examples of 
B can be the set of all rj £ B d m such that V £ supp(r/) are pairwise disjoint or of all r\ £ Sf m 
such that every set V from supp(r7) has a non-empty intersection with at most one other set from 
supp(»7). 

It is clear from the definition that the parameters (77, {fv}vesupp(ji)) are not identifiable. In- 
deed, two different collections (t], {fv}vesupp(rj)) and (fj, {fv}vesupp(f))) may lead to the same 
compound function /. Of course, this is not necessarily an issue as long as only the problem of 
estimating / is considered. 

We now define the Sobolev classes of functions of many variables that will play the role of £y 
Consider an orthonormal system of functions { ( fij}j,£z d m L 2 ([Q,l] d ) such that (fio(x) = 1. We 
assume that the system {tfj} and the set B are such that 

I E E «i.m|[<a E E %,v> ( 2 ) 

yesupp(?7) j-j^o yesupp(?7) j-j^o 

su ppU)Q v supp(j)cy 

for all r\ £ B and all square-summable arrays (8jy, (j, V) E 1* d x Vf), where C» > is a constant 
independent of s,m and d. For example, this condition holds with C* = 1 if B is the set of all 
r\ e B d s m such that V £ supp(r7) are pairwise disjoint and with C* = 3/2 if B is the set of all 
i~l G B d rn such that every set V from supp(Tj) has a non-empty intersection with at most one other 
set from supp(»7). 

One example of {^>j}j£z d is a tensor product orthonormal basis: 

<Pj(B) = (& e=1 <PjA x e)> (3) 

where j = (ji, . . . ,jd) £ Z d is a multi-index and {</?&}, k € Z, is an orthonormal basis in £ 2 ([0, 1]). 
Specifically, we can take the trigonometric basis with (fio{ u ) = 1 on [0, 1], (fik(u) = y/2 cos(27r ku) 
for k > and ipt (u) = V% sin(27r ku) for k < 0. To ease notation, we set 9j [/] = (/, ipj) for j £ 1 d . 

For any set of indices V C {1, . . . , d} and any /3 > 0, i > 0, we define the Sobolev class of functions 

^(/3,i) = |ge^([0,l] d ): 9= E and Eljl~ fl i^ 2 ^ L }' ^ 

jez d :supp(j)cv jez d ^ 

Assuming that {(fij} is the trigonometric basis and / is periodic with period one in each coordinate, 
i.e., f(x + j) = f(x) for every x £ R d and every j £ Z d , the condition /y £ Wy{(3,L) can be 
interpreted as the square integrability of all partial derivatives of /y up to the order j3. 

Let us give some examples of compound models. 



Note that every function of less than s variables can also be considered as a function of s variables. 



4 



Dalalyan, Ingster and Tsybakov 



Additive models are the special case s = 1 of compound models. Here, additive models are 
understood in a wider sense than originally defined by Stone (24J. Namely, for s — 1 we have 
the model 

f(x) = f + J2 fifa)' x = • • • . x d) e R d , 

where J is any (unknown) subset of indices and not necessarily J = {1, . . . , d}. Estimation and 
testing problems in this model when the atoms belong to some smoothness classes have been 
studied in Ingster and Lepski [l3[ , Meier et al. [H[ , Koltchinskii and Yuan [l5| , Raskutti et al. 



[20(, Gayraud and Ingster [11|, Suzuki [25|. 

Single atom models are the special case m = 1 of compound models. If m = 1 we have 
f(x) = fv(xv) for some unknown V C {1, . . . , d}, i.e., there exists only one set V for which 
rjv = 1, and IV | < s. Estimation and variable selection in this model were considered by Bertin 
and Lecue 0], Comminges and Dalalyan [fjj], Lorenzo Rosasco The case of small s and 
large d is particularly interesting in the context of sparsity. In a parametric model, when fy is 
a linear function, we are back to the s par se high-dimensional linear regression setting, which 



has been extensively studied, see, e.g., [27 



Tensor product models. Let A be a given finite subset of Z, and assume that ipj is a tensor 
product basis defined by Consider the following parametric class of functions 

T V (A) = {/ : R d -> R : 3f, {9 jy }, such that / = / + ]T £ ty.vPi}, (5) 

V6supp(r7) j£Jv,A 



where 



Jv^-{je-4 d :supp(j)cy}. (6) 



We say that function / satisfies the tensor product model if it belongs to the set T V (A) for 
some q 6 B. We define 

FsATa) =\J ve -T v (A). 

Important examples are sparse high-dimensional multilinear/polynomial systems. Motivated 
respectively by applications in genetics and signal processing, they have been recently studied 
by Nazer and Nowak [3] in the context of compressed sensing without noise and by Kekatos 
and Giannakis [l4j in the case where the observations are corrupted by a Gaussian noise. 
With our notation, the models they considered are the tensor product models with A = {0, 1} 
(linear basis functions ipj) in the multilinear model of [l9j and A = { — 1,0,1} in the Volterra 
filtering problem of [14| (second-order Volterra systems with <po(x) = 1, fi(x) oc (x— 1/2) and 
(p-i(x) oc x 2 — x + 1/6). More generally, the set A should be of small cardinality to guarantee 
efficient dimension reduction. Another approach is to introduce hierarchical structures on the 
coefficients of tensor product representation 0, [l[ . 



In what follows, we assume that / belongs to the functional class F s . m (S) where either £ = 
{W v (f3,L)} VeV * 4 W(/3,L) or S = T A . 

The compound model is described by three main parameters, which are the dimension m that 
we call the macroscopic parameter and that characterizes the complexity of possible structure 
vectors T], the dimension s of atoms in the compound that we call the microscopic parameter, and 
the complexity of functional class S. The latter can be described by entropy numbers of £ in 
convenient norms, and in the particular case of Sobolev classes, it is naturally characterized by the 
smoothness parameter (3. The integers m and s are "effective dimension" parameters. As soon as 
they grow, the structure becomes less pronounced and the compound model approaches the global 
nonparametric regression in dimension d, which is known to suffer from the curse of dimensionality 
already for moderate d. Therefore, an interesting case is the sparsity scenario where s and/or m 
are small. 
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2. Overview of the results and relation to the previous work 



Several statistical problems arise naturally in the context of compound functional model. 

Estimation of /. This is the subject of the present pape£. We measure the risk of arbitrary 
estimator f e by its mean integrated squared error E/[||/ £ — /|||] and we study the minimax 
risk 

inf sup E f [\\f e - /|||], 
where inf r denotes the minimum over all estimators. A first general question is to establish 

Je 

the minimax rates of estimation, i.e., to find values ip s ^ mt£ (£) such that 

inf sup E,[\\fs-f\\l]xil) a , m ,e(£), 

fe fZ?s, m ( S ) 

when £ is a Sobolev, Holder or other class of functions. A second question is to construct 
optimal estimators in a minimax sense, i.e., estimators f £ such that 

sup E f [\\f e -f\\ 2 2 ] <Ci> s , m , E (E), (7) 

for some constant C independent of s, m, e and S. Some results on minimax rates of estimation 
of / are available only for the case s = 1 (cf. the discussion below). Finally, a third question 
that we address here is whether the optimal rate can be attained adaptively, i.e., whether 
one can construct an estimator f s that satisfies jj]) simultaneously for all s,m,f3 and L when 
S = W(/3,L). We will show that the answer to this question is positive. 

Variable selection. Assume that m = 1. This means that f(x) = fv{xv) for some unknown 
V C {1, . . . , d}, i.e., there exists only one set V for which r/v — 1 (a single atom model). Then 
it is of interest to identify V under the constraint |V| < s. In particular, d can be very large 
while s can be small. This corresponds to estimating the relevant covariates and generalizes 
the problem of selection of sparsity pattern in linear regression. An estimator V n C {1, . . . , d} 
of V is considered as good, if the probability P(V n — V) is close to one. 

Hypotheses testing (detection): The problem is to test the hypothesis Ho : / = (no signal) 
against the alternative Hi : f e A, where A = {/ € J 7 s _ m (S) : \\fW2 > r}. Here, it is 
interesting to characterize the minimax rates of separation r > in terms of s, m and S. 



Some of the above three problems have been studied in the literature for special cases s = 1 
(additive model) and m = 1 (single atom model). Ingster and Lepski [l3| studied the problem of 
testing in additive model and provided asymptotic minimax rates of separation. Sharp asymptotic 
optimality under additional assumptions in the same problem was obtained by Gayraud and Ingster 



111 ]. Recently, Comminges and Dalalyan [6| established tight conditions for variable selection in 



the single atom model. We also mention an earlier work of Bertin and Lecue [2j dealing with 
variable selection. 



The problem of estimation has been also considered for additive model and class S defined as a 
reproducing kernel Hilbert space, cf. Koltchinskii and Yuan [15[, Raskutti et al. [20|. In particular, 
these papers showed that if s = 1 and £ = W(/3, L) is a Sobolev class, then there is an estimator 
of / for which the mean integrated squared error converges to zero at the rate 

max(me 4 ^( 2 ' 3 + 1 ), me 2 log d). (8) 

Furthermore, Raskutti et al. [20l Thm. 2] provided the following lower bound on the minimax risk: 



max me 2 log I — ] ] . 



(9) 
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Note that when m is proportional to d, this lower bound departs from the upper bound in a 
logarithmic way. It should also be noted that the upper bounds in these papers are achieved by 
estimators that are not adaptive in the sense that they require the knowledge of the smoothness 
index /3. 

In this paper, we establish non-asymptotic upper and lower bounds on the minimax risk for the 
model with Sobolev smoothness class £ — W(j3,L). We will prove that, up to a multiplicative 
constant, the minimax risk behaves itself as 

max {mL s 'W + ^e^lW +s \ mse 2 log (-^77) } A L (10) 

(we assume here d/(sm 1 ^ s ) > 1, otherwise a constant factor greater than 1 should be inserted 
under the logarithm, cf. the results below). In addition, we demonstrate that this rate can be 
reached in an adaptive way that is without the knowledge of j3, s, and m. The rate (|10[) is non- 
asymptotic, which explains, in particular, the presence of minimum with constant L in (|10p. For 
s = 1, i.e., for the additive regression model, our rate matches the lower bound of [20j | . 

For m = 1, i.e., when f(x) — fv(&v) f° r some unknown V C {1, . . . , d} (the single atom model), 
the minimax rate of convergence takes the form 

max j i '/(2/S+«) e 4/»/(20+-) j S£ 2 lQg f£\ | AL (n) 

This rate accounts for two effects, namely, the accuracy of nonparametric estimation of / for fixed 
macroscopic structure parameter rj, cf. the first term ~ £ i P/( 2 P+ s ) ) anc l the complexity of the 
structure itself (irrespective to the nonparametric nature of microscopic components /y(ccy)). In 
particular, the second term ~ se 2 lag(d/s) in (fTTl) coincides with the optimal rate of prediction 
in linear regression model under the standard sparsity assumption. This is what we obtain in the 
limiting case when /3 tends to infinity. It is important to note that the optimal rates depend only 
logarithmically on the ambient dimension d. Thus, even if d is large, the rate optimal estimators 
achieve nice performance under the sparsity scenario when s and m are small. 



3. The estimator and upper bounds on the minimax risk 



In this section, we suggest an estimator attaining the minimax rate. It is constructed in the 
following two steps. 

Constructing weak estimators. At this step, we proceed as if the macroscopic structure pa- 
rameter 77 were known. The goal is to provide for each r] a family of "simple" estimators 
of / containing a rate-minimax one. To this end, we project Y onto the basis functions 
Wj '■ Ij'loo - £ 2 } and denote 

Y e = {Y j ^YW j ):jeI J d , bL<e- 2 ). 

We proceed at this step as if r\ was known and denote by V\ , . . . , V m the elements of the 
support of r). Let us fix an integer-valued vector t = {tv t A = l,...,m) e [0,e~ 2 ] m and set 
0t, v = {9t,r,,j ■ j € Z d , Ij'I^ < e~ 2 ), where 6 t , v ,o = Y and 

-§ . = iYj, ^ s.t. supp(j) C V t , Ij'U G [l,t Vl ], 
' 3 1 0, otherwise 

if j 0. Based on these estimators of the coefficients of /, we recover the function / using the 
estimator 
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Smoothness- and structure-adaptive estimation: The goal in this step is to combine the 
weak estimators {/t,^}*,^ in order to get a structure and smoothness adaptive estimator of / 
with a risk which is as small as possible. To this end, we use a version of exponentially weighted 
aggregate 0, H, m the spirit of sparsity pattern aggregation as described in [H, [23| . More 
precisely, for every pair of integers (s, m) such that s <E {1, . . . , d} and m 6 {1, . . . , Md, s }i we 
define prior probabilities for (t,rj) <E [0,e" 2 ] m x {B^ m \ B*_ x m ) by 

sm d M d , fl 

nt « = H d (l + [e->])™\Bi m \BUJ> Hd = ^r m ~ e ' (12) 

For s = and the unique rj £ Bq x we consider only one weak estimator Qt,-q with all entries 
zero except for the entry 0i ;T7oj o> which is equal to Yq. We set ^t,ri — l/H^. It is easy to see 
that 7r = ( TTt,ri', (t, v) £ Us m{[0' £_2 l" 1 x Bs,m}j defines a probability distribution. For any 
pair (t, rf) we introduce the penalty function 

pen(i,rj) =2e 2 J[ (2t v + 

VSsupp(r7) 

and define the vector of coefficients £ = (9 e j : j G Z d , |j| < £~ 2 ) by 

2 _V*V*\^2 ex P { ~ 4?(l y g ~ ^ll + P cn (*> *?)) }*t,v 

,=i m=i (t,t,) £ ff =i Em=i E ( t,f,) exp { - ^ (|y e - 6>^| 2 + pen(t, rj)) }tt^ 

where the summations 2(t,»j) an( ^ E(t,n) corres P on d to (t, 77) e [0,£~ 2 ]] m x (B^ m \ Bf_ l m ) 
and (i, rj) € [[0,£~ 2 ]] m x \ <Bs_i,m)> respectively. The final estimator of / is 

Note that each Qt.-q is a projection estimator of the vector 6 = (Qj[f])j£z d - Hence, f e is a convex 
combination of projection estimators. We also note that, to construct f e , we only need to know 
e and d. Therefore, the estimator is adaptive to all other parameters of the model, such as s, m, 
the parameters that define the class S and the choice of a particular subset B of Bf m . 

The following theorem gives an upper bound on the risk of the estimator f e when £ = W(/3, L). 

Theorem 1. Let /3 > and L > be such that log(e" 2 ) > (2/3)" 1 log(L), L > e 2 log^e" 2 ) 2 ^ . 
Let B be any subset of Bf m . Assume that condition holds. Then, for some constant C(f3) > 
depending only on j3 we have 



sup 



E/[||/e - /||| < (6L) A (m{c(/3)L^£^ + As £ 2 log ( J^) }) • (14) 



Proof. Since the functions ipj are orthonormal, Y e is composed of independent Gaussian random 
variables with common variance equal to e 2 . Therefore, using 16|, Cor. 6] we obtain that the 
estimator f e satisfies, for all /, 



E/[||/ E - /|| 2 ] < nun {V f [\\f t „ - /|| 2 ] + 4 £ 2 log^)] , (15) 

where the minimum is taken over all (t,rj) £ [j m {[0,e _2 ] m x B^ m }. Denote by tj the unique 
element of Bq x for which supp(r/) = {0}. The corresponding estimator ft,ri Q coincides with the 



<s 



Dalalyan, Ingster and Tsybakov 



constant function equal to Yq and its risk is bounded by e 2 + L for all / £ J- s , m (W(f3, L)). 
Therefore, 

sup E f [\\f e ~f\\ 2 2 }< sup E / [||/ t ^ o - +4e 2 log(7r t -i ) 

/&F., m (W09.£)) /eJ>, m (W(/9,i)) 

< e 2 + L + 4e 2 < 6L. (16) 

Take now any / £ J" s , m (W(/3, L)), and let f|*eSC B£ m be such that / £ J",, (W(/3, L)). Then 
it follows from ([15} that 

E/[||/e - /III] < min (E/III/*,,. - /|| 2 ] +4e 2 log(7r t -i,)) 

< min E,[\\ft, v . -/|| 2 ]+4 £ 2 (mlog(2e- 2 )+m S log(2)+log( e |^. m |)). (17) 

te[o,e- 2 ] m 

Note that for all ci, s € N such that s < d we have 

Also, we have the following bound on the risk of estimator ft :n for each rj £ B and for an 
appropriate choice of the bandwidth parameter t £ [0, E~ 2 \ m . 



Lemma 1. Let (5 > 0. L > e 2 be such that log(e -2 ) > (2/3)" 1 log(L). Lett £ [0,e" 2 ] m be a vector 
with integer coordinates ty e — [(L/ (3\ Ve \£ 2 }y/( 2 P+\ v i\) A e~ 2 ], £ = 1, . . . , m. Assume that condition 
HP holds. Then 

sup E f [\\f t , v -ff 2 ] <2C^ s mL s ^ 2 ^e^/^ +s \ V r, e m . (19) 
Proof of this lemma is given in the appendix. 

Combining (fTTf with (|T5)) and ([T9"]) yields the following upper bound on the risk of f £ : 
sup E/[||/ £ - /|||] < m(c^i^e^Ti + 4e 2 log(2£- 2 ) + 4se 2 log ( h 



/S^ s , m (W(/3,L)) 



sm 



l/s 



where Cp > is a constant depending only on j3. The assumptions of the theorem guarantee that 

s 4/3 

£ 2 log(2e~ 2 ) < L 2 p+ b e 2 f+ s , so that the desired result follows from (fTB"]) and the last display. 

The behavior of the estimator f e in the case £ = Ta is described in the next theorem. 
Theorem 2. Assume that k = max{|£| : £ £ A} < e~ 2 . Then 

sup V f l\\l - f\\ 2 2 ] < TO £ 2 {(2fc + l) s +41og(2£- 2 )+4 S log(J^)}- (20) 



Proof of Theorem [5] follows the same lines as that of Theorem [T] We take / £ F S:in {TjC) : and let 
rj* £ B C Bf m be such that / £ ^j*(Ta). Let i* € R m be the vector with all coordinates equal 
to k. Then the same argument as in ([TT]) yields 

E/HIA-/II 2 ] <V f [\\ft*, v * -f\\ 2 2 }+4e 2 (m\og(2e- 2 ) + ms\og(2)+\og(e\Bi m \)). (21) 
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We can write supp(r7*) = {V\ 1 . . . , V m } where \Ve\ < s. Since the model is parametric, there is no 
bias term in the expression for the risk on the right hand side of (|2ip and we have (cf. I|28p): 

m 

%[||Aw - /III] < E E s2l u^m ^ m£2 ( 2k + !) s - 

t=l j:supp(j)CVi 

Together with (JUJ), this implies (l2H . 

The bound of Theorem [2] is particularly interesting when k and s are small. For the examples of 
multilinear and polynomial systems [ljj [l4| we have k = 1. We also note that the result is much 



better than what can be obtained by using the Lasso. Indeed, consider the simplest case of single 
atom tensor product model (m = 1). Since we do not know s, we need to run the Lasso in the 
dimension p = k d and we can only guarantee the rate e 2 logp = de 2 \ogk, which is linear in the 
dimension d. If d is very large and s <^ d, this is much slower than the rate of Theorem [2l 



4. Lower bound 



In this section, we prove a minimax lower bound on the risk of any estimator over the class 
^s,m(W(l3, L)). We will assume that {ifij} is the tensor-product trigonometric basis and B = B d m 
where B d sm denotes the set of all r\ E B d m such that the sets V E supp(r7) are disjoint. Then 
condition (J2J) holds with equality and C* = 1. We will split the proof into two steps. First, we 
establish a lower bound on the minimax risk in the case of known structure ry, i.e., when / belongs 
to the class J r ?7 (W r (/3, L)) for some known parameters rf 6 B and /3, L > 0. We will show that the 
minimax risk tends to zero with the rate not faster than me 4 ' 5 / ( 2 P+ S ) . I n a second step, we will prove 
that if T) is unknown, then the minimax rate is bounded from below by mse 

if the function / belongs to -^(O) for a set spanned by the tensor products involving only the 
functions tp\ and if-i of various arguments. 



4-1. Lower bound for known structure r\ 



Proposition 1. Let {fj} be the tensor-product trigonometric basis and let s,m,d be positive in- 
tegers satisfying d > sm. Assume that L > e 2 . Then there exists an absolute constant C > such 
that 

inf sup E/[||/-/|||] >CmL s IW +s ^yW +s \ \f V G B d sm . 

f /e-F„(W(/3,£)) 

Proof. Without loss of generality assume that m = 1. We will also assume that L = 1 (this is 
without loss of generality as well, since we can replace e by e/y/L and by our assumption this 
quantity is less than 1). After a renumbering if needed, we can assume that rj is such that r\ v = 1 
for V = {1, ... , s} and r\ v = for V ^ {1, . . . , a}. 

Let t be an integer not smaller than 4. Then, the set / of all multi-indices k 6 Z s sat- 
isfying \k\oo < t is of cardinality |7| > 9. For any u: = (cok,k G I) E {0, l} 7 , we set 
fu(x) = 7 J2kei ^Wk{xi, . . . ,x s ), where ip k (xi, . . . , x s ) = ]T*=i ^ (xj), k = (ki,...,k s ), is 
an element of the tensor-product trigonometric basis and 7 > is a parameter to be chosen later. 
In view of the orthonormality of the basis functions (pk, we have 

||/ w ||!=7 2 Mi, V u,E {OA} 1 . (22) 

Therefore, we have £ fc \k\ 2 £6 k [fj\ 2 < < 2/3 ||/ w ||i < t 2 V(2i + l) s < 7 2 (2i + l) 2 ^ +s . Thus, the 
condition 7 2 (2t + l) 2l3+s < 1 ensures that all the functions f u belong to W((3, 1). 
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Furthermore, for two vectors uj,uj' 6 {0, l} 1 we have \\f w — fu'W^ = 7 2 | w — <^'|i- Note that the 
entries of the vectors u>,u>' are either or 1, therefore the l\ distance between these vectors 
coincides with the Hamming distance. According to the Varshamov-Gilbert lemma [2|| Lemma 
2.9], there exists a set ft C {0, l} 7 of cardinality at least 2I 7 '/ 8 such that it contains the zero 
element and the pairwise distances \u> — u)'\i are at least |/|/8 for any pair w, uj' € fi. 

We can now apply Theorem 2.7 from [26| that asserts that if, for some r > 0, we have 
mm^^'en ||/« - /w'lla > 2r > 0, and 

Wi^^>^ < 23 > 

where /C(-,-) denotes the Kullback-Leibler divergence, then inf jmax^g^ E/„[||/ — /will] ^ c ' r2 

for some absolute constant d > 0. In our case, we set r = jy/\I\/32. Combining (l22t and the 
fact that the Kullback-Leibler divergence between the Gaussian measures P / and P ff is given by 
\e- 2 \\f-g\\l we obtain ^E we ^(P/„.Po) < V|/| • H 7 2 < (log2) £ 2 /64, then ® is 
satisfied and r 2 = 7 2 (2i + l) s /32 is a lower bound on the rate of convergence of the minimax risk. 

To finish the proof, it suffices to choose t € N and 7 > satisfying the following three conditions: 
t > 4, 7 2 < {2t + l)~ 2f3 - s and 7 2 < e 2 log(2)/64. For the choice 7 ~ 2 = (2t + l) 2/3+s + e~ 2 64/ log(2) 
andt = [4e~ 2 /( 2 ^ +s )] all these conditions are satisfied and r 2 > cie 4l3 ^ 2l3+s " > for some absolute 
positive constant c\. 



4-2. Lower bound for unknown structure r\ 



Proposition 2. Let the assumptions of Proposition [7] be satisfied. Then there exists an absolute 
constant C > such that 

inf sup E/[||/- f\\l] > C min \L, mse 2 log 

Proof. We use again Theorem 2.7 in (2(| but with a choice of the finite subset of J- s ^ m {W{(3, L)) 
different from that of Proposition!!] First, we introduce some additional notation. For every triplet 
(m, s, d) £ Nl satisfying ms < d, let Vf m be the set of collections ir — { Vi, . . . , V m } such that 
each Vg C {l,...,d} has exactly s elements and Vj's are pairwise disjoint. We consider Vf m 

as a metric space with the distance p(ir, w') = =j 5^?Li 10^ ^ {^', . . . , l^}) = — ^ • , where 
7r' = {V/, . . . , V^} € m . It is easy to see that p(-, •) is a distance bounded by 1. 

For any i3 e (0,1), let A/" s d m (i9) denote the logarithm of the packing number, i.e., the loga- 
rithm of the largest integer K such that there are K elements n^ 1 ', . . . t w^ K ' of Vf m satisfying 
p(7r^ fc ', 7r( fe >) > 1?. To each 7!"( fc ' we associate a family of functions W = {fk,u> : w G { — 1, l} ms , fe = 
1 , . . . , K} defined by 

fk,u,{x) = — ^= y~] (p Wi v(x v ), 

where r = (1/4) min (e y/ms log 2 + log K , v^) and (p u ,v(xv) = HjeV Pvjfai)- Using that {tpj} 
is the tensor-product trigonometric basis it is easy to see that each fk, w belongs to J- s ^ m {W(f3, L)). 
Next, |W| = 2 ms K and, for any fk <u> € U, the Kullback-Leibler divergence between P/ fc w and Po 

is equal to £(P/ fc ,„,Po) = 5 e ~ 2 ||/fc,w|| 2 = 6 2 — l ° B ±^ • Furthermore, the functions fk tU > ar e not 
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too close to each other. Indeed, since {ipj} is the tensor-product trigonometric basis we get that, 
for all fh,u,fv,u' € U, 

||/fc,u> - /*;',«' 111 = t 2 to _1 ^2to- EE/ ipu,,v(xv)¥>u',v>(xv')dx^ 

V&rW V'£n<. k ') 

= t2 ( 2 -- E E 1 ( y = y ')) =2r 2 p(7r ( ' £) ,7r( fc ')) >2tfr 2 . 

These remarks and Theorem 2.7 in [26j imply that 

infsupE / [||/-/|||] >c 3 i?r 2 = ^ min (i, e 2 (m S log 2 + log X)} (24) 
/ feu ' lb I > 

for some absolute constant C3 > 0. Assume first that d < 4sm 1 / s . Then ms log 2 > log ( ST ^f/ 3 ) 
and the result of the proposition is straightforward. If d > 4sm 1 / s we fix 1} = 1/8 and use the 
following lemma (cf. the Appendix for a proof) to bound \ogK — Mf m {d) from below. 

Lemma 2. For any § e (0, 1/8] we have N* m ($) > -mlog ( 8e7/ ^ 1/2 ) + i og (^^). 
This yields 

m S log2+A/; d m (tf) > ^log(^) -mlog((8/7) e 7 /V/ 2 ). (25) 

It is easy to check that m log ((8/7)e 7 / 8 s 1 / 2 ) < 1.01ms, while for d > 4sm 1 / s we have 
3 l°g ( S mV° ) — 1-15. Combining these inequalities with (|2~4"|) and (|23|) we get the result. 

5. Discussion and outlook 

We presented a new framework, called the compound functional model, for performing various 
statistical tasks such as prediction, estimation and testing in the context of high dimension. We 
studied the problem of estimation in this model from a minimax point of view when the data are 
generated by a Gaussian process. We established upper and lower bounds on the minimax risk 
that match up to a multiplicative constant. These bounds are nonasymptotic and are attained 
adaptively with respect to the macroscopic and microscopic sparsity parameters m and s, as well 
as to the complexity of the atoms of the model. In particular, we improve in several aspects 
upon the existing results for the sparse additive model, which is a special case of the compound 
functional model (only for this case the rates were previously explicitly treated in the literature) : 

— The exact expression for the optimal rate that we obtain reveals that the existing methods 
for the sparse additive model based on penalized least squares techniques have logarithmically 
suboptimal rates. 

— On the difference from most of the previous work, we do not require restricted isometry type 
assumptions on the subspaces of the additive model; we need only a much weaker one-sided 
condition @. Possible extensions to general compound model based on the existing literature 
would again suffer from the rate suboptimality and require such type of extra conditions. 

— When specialized to the sparse additive model, our results are adaptive with respect to the 
smoothness of the atoms, while all the previous work about the rates considered the smoothness 
(or the reproducing kernel) as given in advance. 
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For the general compound model, the main difficulty is in the proof of the lower bounds of the 
order mse 2 \og(d/(sm 1 ^ s )) that are not covered by the standard tools such as the Varshamov- 
Gilbert lemma or fc-selection lemma. Therefore, we developed here new tools for the lower bounds 
that can be of independent interest. 

An important issue that remained out of scope of the present work but is undeniably worth 
studying is the possibility of achieving the minimax rates by computationally tractable procedures. 
Clearly, the complexity of exact computation of the procedure described in Section [3] scales as 
e _2 ™ l 2 Md ' s , which is prohibitively large for typical values of d, s and m. It is possible, however, 
to approximate our estimator by using a Markov Chain Monte-Carlo (MCMC) algorithm similar 



to that of 22, 23]. The idea is to begin with an initial state (to,T)o) and to randomly generate 
a new candidate {u, £) according to the distribution q(-\to,f] ), where q(-\-) is a given Markov 
kernel. Then, a Bernoulli random variable £ with probability of the output 1 equal to a = 1 A 

?f(t,^j t(ux\t',v) is drawn and a new state (tuWi) = £ ' ( u >0 + (1 - • {to,Vo) is defined. This 
procedure is repeated K times producing thus a realization {(tk,T] k ); k — 0, . . . , K} of a reversible 
Markov chain. Then, the average value -k X)fe=i ®tk,v k provides an approximation to the estimator 
f e defined in Section [3] 

If s and m are small and q(-\t,r)') is such that all the mass of this distribution is concentrated on 
the nearest neighbors of the rj 1 in the hypercube of 2 Md - s all possible rfs, then the computations 
can be performed in a polynomial time. For example, if s = 2, i.e., if we allow only pairwise 
interactions, each step of the algorithm requires ~ e~ 2m d 2 computations, where the factor e~ 2rn 
can be reduced to a power of log(e -2 ) by a suitable modification of the estimator. How fast such 
MCMC algorithms converge to our estimator and what is the most appealing choice for the Markov 
kernel q(-\-) are challenging open questions for future research. 



Appendix 

A. Proof of Lemma [1] 

Let 77 £ B be such that / £ JV,(W(/3, L)) and supp(rj) = {Vi,...,V m } where \V e \ < s. Then 
there exist a constant / and m functions fx, . . . , f m such that fe £ Wv e ([3, L), £ = 1, . . . , m, and 
/ = / + fx + • • • + fm- Set 9j t i = 9j[fe], {j,£) £ Z d x {1, . . . , to}. Using the notation ti — ty t and 

J = {j £ 7L d : 3 £ £ {1, . . . , m} such that supp(j) C V e and \ j\oo < t t } 

we get 



jeJ\o e=ijez d \o 

m 

e 3,m 1 {supp{j)CV t ;\j\ x >t l: } 

e=i jez d 

m 
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where (£.j)jez d are i-i-d. Gaussian random variables with zero mean and variance one. In view of 
the bias- variance decomposition and ([2]), we bound the risk of ft, n as follows: 

m 

E /[||/t,r, - /||f] < E £% + C * E E 6 'l^ 1 {supp(j)C\/ f ;|jU>ta 
tci m 

-EE £2l {su P p(j)cv r ,\j\vo<ti} +C«EE 6, j,f 1 {supp(j)cy £ ;|jU>ta 

- TO fc T ax m E ( £2l {iiu<*a + c*^ 1 {iiioc>*a)- ( 26 ) 

j:supp(j')CV£ 

In the right-hand side of (|26l) . the first summand is the variance term, while the second summand 
is the bias term of the risk. We bound these two terms separately. For the bias contribution to the 
risk, we find: 



E faMm-yvMu + ir 2 ' E 

j':supp(j)CVf j-supp(j)CV e 

<L(t e + l)- 2 P 



\3fI0l 



< 3 2/3As (Le 4/3 V L s /(2/3+s) £ 4/9/(2/3+s)\ . (27) 
If to > 1, then the variance contribution to the risk is bounded as follows: 



E £2l {\i\^<u} = e 2 (2t e + 1)W < e 2 (M e )W < 2?^ L m/W+m) e W/W+WA) , (2g) 

j:supp(j)CVi> 

where we have used that t £ < (L/3^e 2 ) 1/(2l3+lVe n and \V e \ < s. Finally, note that condition 
log(e^ 2 ) > (2 i S)- 1 log(L) implies that Le 4/3 < j ^/(2/9+«) £ 4/3/(2/3+s) in ^ ThuS) i nequa ii ty 

together with (f2"T|) and (|28p yields the lemma in the case tg > 1. If ^ < 1, i.e., tg = 0, the 
same arguments imply that the bias is bounded by L and the variance is bounded by e 2 . Since 
L > e 2 , the right-hand side of (|2^|) is bounded by (1 + C*)L. One can check that tg equals 
only if L < 3 s e- 2 , and in this case L = £*/(2/3+ S ) L 2/j/(2/3+s) < L ./W+.) E *{i/(2p+.) 2 2p./W+.) < 

32^^/(2/3+^-4/3/(2/3+*) _ Thig completes the proof 



B. Proof of Lemma [2] 

Prior to presenting a proof of Lemma [21 we need an additional result. 

Lemma 3. For a triplet (m,s,d) € satisfying ms < d, let Vf m be the set of all collections 
7r = {Ai, . . . , A m } with Ai C {1, . . . , d} such that = s for all i and A4 n A^ = for i =/= k. 
Then 

\ ms f e^ d \ nls 

Proof. Using standard combinatorial arguments we find 

d = ( d\ (ms)l > ( d\ ms (ms)! 
s ' m \ms) (sl) m ml ~ \ms J (s\) m mV 
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If either s = 1 or m = 1 then (ms)\ — {s\) m m\ and the lower bound stated in the lemma is 
obviously true. Assume now that m > 2 and s > 2. Recall that according to the Stirling formula, 
for every n G N, y/2Tm(n/e) n < n\ < V2nn(n / 'e) n e 1 > /12n . Therefore, 



> 



\/2irms(ms/e) r 



(ms)! 

m\(s\) m ~ V27rm(TO/e) m eT^+ (^/27^s) m (s/e) I 



'/V27T 



-(m-l)/2 



Since the expression is square brackets in the last display is greater than 1 we obtain the desired 
lower bound on \Pf m \. The upper bound follows from (TT8"]) and the fact that l? 3 ^ m | < \Bf m \. 



Proof of Lemma [2] Consider first the case m = 1. The set Vfi is the collection of all subsets of 
{1, . . . , d} having exactly s elements. The distance p is then if the sets coincide and 1 otherwise. 
Thus, we need to bound from below the logarithm of \Vf ]J = ( ) . It is enough to use the inequality 
log $ >slog(d/s). 

Assume now that m > 2. Since t:^, . . . ,tt^ k ^ is a maximal ^-separated set of Pf m we have that 
Vf m is covered by the union of p-balls B{-K^ k \{)) of radius i? centered at 7r( fc )'s. Therefore, 

K 

\vi m \<^2\B(^ k \^)\. 

k=l 

It is clear that the cardinality of the ball \B(Tr^ k \-d)\ does not depend on tt^. This yields 

\Km\ 



K > 



|B(7r°,tf)| 



where ir° = {A®, . . . , A^} such that A® — {(i — l)s + 1, . . . , is}. We have already established 
a lower bound on |'P, m | in Lemma [3] We now find an upper bound on the cardinality of the 
ball B(ir°,-d). Let m$ be the smallest integer greater than or equal to (1 — $)m. Consider some 
tt = {A 1 ,..., A m } G V d sm . Note that tt G B(tt°, i?) if and only if 



£i(^gK,...,<})> 



i=l 



This means that there are m$ indexes ii,. . . ,i m # such that the m§ sets A®, are in 7r and the 
remaining m — elements of tt are chosen as an arbitrary collection of m — disjoint subsets of 
{1, . . . , rf}\UjL"i A®. , each of which is of cardinality s. There are (^) ways of choosing {ix, . . . , i m# } 
and once this choice is fixed, there are \"P S m-m^\ ways of choosing the remaining parts. Thus, 
|B(7r°,tf)| < (™)\Pw-£j. Using this inequality and Lemma|we obtain 

K> wu > s - (m - i)/2 {^r 



> S 



l em \ I e 

-(m-l)/2 e 2«(m.-m)-m. ( d V"* /m*\»» A + Sm e X 
\sm}/ s J \mJ \ d ~ smg 



Since i? < 1/8 we have m$ > m(l — #) > 7m/8 and after some algebra we deduce from the 
previous display that 



ms , f8e 7 / 8 s 1 / 2 \ 7ms, ( d , 
> - X - mlog + — log (—^ ) . (29) 
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Assume first that s > 3. Since also m > 2 we have 2 1 1 / 3 < m 1 1 / s = m , 8 , < — T7T. Hence 

and the result of the lemma follows from the inequality 7/8 — l/log(2 8 / 3 ) > 1/3. It remains to 
consider the case s £ {1,2}. If the right-hand side of the inequality of the lemma is negative, then 
the result is trivial. If the right-hand side is positive, we have log(8e 7 / 8 /7) < § log ( 1/a ) for 
s £ {1,2}. Therefore, from (|2"TJ1) we obtain 

, tus , / d \ , /8e 7 /8 s V2\ 7ms 
\og(K) > -— — log —- - to log + -— log 



61og(8e 7 /8/7) to Vs™ 1/s / V 7 / 8 Vsw 1/s 

and the result of the lemma follows from the inequality 7/8 — (6 log(8e 7 / 8 /7)) _1 > 1/2. 
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